The Metropolis-Hastings algorithm is a powerful MCMC algorithm that allows us to sample from complex target distributions. However, its proposal step, which usually involves sampling from a Gaussian distribution, is highly inefficient, especially when the target distribution is high-dimensional. To explore the parameter space more efficiently, the Hamiltonian Monte Carlo (HMC) algorithm borrows the idea of Hamiltonian system from Physics and introduces an auxiliary momentum variable to its sampling scheme. Instead of sampling from the target distribution \(\pi(q)\), we would now sample from \(\pi(q,p)\), which is the joint distribution of the parameter of interest and the auxiliary momentum variable. As a result, at each iteration, new samples would be proposed according to the laws of a Hamiltonian system instead of a random walk.
A Hamiltonian system describes the dynamics between the position \(q\) and the momentum \(p\) of a particle. The Hamiltonian \(H(q,p) = U(q) + K(p)\), where \(U\) is the potential energy and \(K\) is the kinetic energy, is the total energy of a particle and is conserved over time. Suppose the position is d-dimensional, then for each \(q_d\), we introduce a momentum variable \(p_d\).
In a Hamiltonian system, the movement of the particles over time is entirely governed by the Hamiltonian equations: \[\begin{align*} \frac{d q_i}{d t} &= +\frac{\partial H}{\partial p_i} =+\frac{\partial K}{\partial p_i} \\ \frac{d p_i}{d t} &= -\frac{\partial H}{\partial q_i} = - \frac{\partial U}{\partial q_i} \end{align*}\]
To solve the Hamiltonian equations (i.e., to propose samples of \(q\) and \(p\) that follow the rules of the Hamiltonian equations), we would use a symplectic integrator, in particular the Leapfrog Integrator.The Leapfrog Integrator: Given staring values for \(q\) and \(p\), we update them through the following steps: \[\begin{align*} \tilde p &\leftarrow p - (\epsilon/2)\nabla_q U(q) \\ \tilde q &\leftarrow q + (\epsilon)\nabla_p K(\tilde p) \\ \tilde p &\leftarrow \tilde p - (\epsilon/2)\nabla_q U(\tilde q) \end{align*}\]
To link the Hamiltonian to the probability distribution we wish to sample from, which is now \(\pi(q,p)\), we make the following definitions:
The (unnormalized) joint density of the parameter of interest and the momentum: \[\pi(q,p) \propto \exp(-H(q,p))\]
The potential energy function: \[U(q) \equiv -\mathcal{L}(q)\], where \(\mathcal{L}(q)\) is the log likelihood of the target distribution \(\pi(q)\), up to a normalizing constant.
The kinetic energy function: \[K(p) = \frac{1}{2} \ p^t p\] (I made this choice of kinetic energy function for tutorial purpose, but it could be more complicated.)
Those definitions yield us the following equalities: \[\begin{align*} \pi(q,p) &\propto \exp(-H(q,p)) \\ &= \exp(-U(q)-K(p)) \\ &= \underbrace{\exp(\mathcal{L}(q)}_{\pi(q)} - \frac{1}{2} \ p^tp) \end{align*}\]
Also, the updating equations for the leapfrog integrator becomes: \[\begin{align*} \tilde p &\leftarrow p + (\epsilon/2)\nabla_q \mathcal{L}(q) \\ \tilde q &\leftarrow q + (\epsilon)\tilde p \\ \tilde p &\leftarrow \tilde p +(\epsilon/2)\nabla_q \mathcal{L}(\tilde q) \end{align*}\]
The Hamiltonian Monte Carlo algorithm is split into two steps at each iteration: a Gibbs step for the momentum \(\pi(p)\), and a Metropolis step for the joint distribution \(\pi(q,p)\). In the first step, we sample a random momentum \(p \sim N_d(0,I)\) (suppose the parameter of interest is d-dimensional). In the second step, we propose new values for \(q\) and \(p\) using the leapfrog integrator, then accept the proposal with probability: \[ \min\{1,\frac{\pi(\tilde q,\tilde p)}{\pi(q,p)}\} = \min\{1, \frac{\exp(-H(\tilde q,\tilde p))}{\exp(-H(q,p))}\} \] , where \(\tilde q\) and \(\tilde p\) are the proposed values.
After getting the samples from \(\pi(q,p)\), we could then marginalize out the auxiliary momentum variables to obtain samples from the target distribution \(\pi(q)\).
Note that the acceptance probability becomes \(1\) if we could simulate the Hamiltonian system exactly, since the energy is conserved. Because the leapfrog integration is just an approximation, we need the Metropolis step to correct for the approximation error.
On the other hand, the Gibbs step for the momentum is necessary for the samples to jump between different energy level sets (Betancourt, 2017): \[H^{-1}(E) = \{ q,p\ |\ H(q,p) = E\}\] Without resampling the momentum, we would always be drawing samples on the same energy level set (since the leapfrog integrator only gives us samples under the same Hamiltonian).
Below is an outline of the algorithm (Matthew D. Hoffman and Andrew Gelman, 2014). There are two hyper-parameters for the leapfrog integrator: \(\epsilon\), the step size, and \(L\), the number of steps. The effects of the hyper-parameters will be discussed in the last section. In addition, we also have to solve for the gradient of the log likelihood function \(\nabla_q \mathcal{L}(q)\), which could be easily done through automatic differentiation (available in probabilistic programming languages such as Pyro/Edward2/Stan/PyMC3).
Here I demonstrate sampling from a standard normal using HMC. We only need two information: one is \(\mathcal{L}(q)\) (up to an normalizing constant), for the Metropolis step, and one is \(\nabla_q \mathcal{L}(q)\), for the leapfrog integrator. The gradient \(\nabla_q \mathcal{L}(q)\) could be calculated simply by hand in this example. We have:
\[\begin{align*} \mathcal{L}(q) &\propto -\frac{1}{2}q^2 \\ \nabla_q \mathcal{L}(q) &= -q \end{align*}\]
Note that the Hamiltonian equations for this example is:
\[\begin{align*} \frac{d q}{d t} & =+\frac{\partial K}{\partial p} = p\\ \frac{d p}{d t} &= - \frac{\partial U}{\partial q} = \nabla_q \mathcal{L}(q) = -q \end{align*}\]
, which has an analytical solution: \[\begin{align*} q(t) &= r \cos(a+t) \\ p(t) &= -r \sin(a+t) \end{align*}\], where \(r\) and \(a\) are some constants.
The solution corresponds to a clockwise rotation around the origin in the \((q,p)\) plane (Neal, 2010).
Below is a demonstration of 4 iterations of the HMC sampler, with step size \(\epsilon= 0.2\) and number of steps \(L= 15\) (Also, all proposed values are accepted in the figure).
The first figure shows the evolution of position and momentum over 4 iterations. We could see that at the start of each iteration, the resampled momentum brings the sampler to a different energy level set.
The second figure shows the evolution of energy over 4 iterations. We could make two observations. First, the total energy (the Hamiltonian) is nearly conserved within each iteration since we chose a small step size. Second, the change in energy between each iteration is sheerly caused by the change in kinetic energy (due to the resampling of the momentum).
Here I compare the behaviour of a HMC sampler with that of a MH sampler using an example of bivariate normal. The covariance matrix of the bivariate normal is: \(\begin{pmatrix}1 & 0.3 \\ 0.3 & 1\end{pmatrix}\).
Below is the first 100 HMC samples after 200 burn-in steps. I set \(\epsilon = 0.7\) and \(L = 9\).
Below is the first 100 MH samples after 200 burn-in steps. I set the standard deviation of the propsal Gaussian distribution to \(0.7\)
We could see that the samples in the HMC rarely stay at the same position, since we have a really low rejection rate \((0.08)\). We could also tell that there are some patterns in the sampling path. On the other hand, the rejection rate for this MH sampler is about \(0.33\), so we could see the samples staying at the same position sometimes. Also, the samples behave like a random walk.
The choice of hyper-parameters is critical for the efficiency of HMC. Both hyper-parameters appear in the leapfrog integrator: the step size \(\epsilon\), and the number of steps \(L\). A large \(\epsilon\) may lead to poor approximation to the solution of the Hamiltonian equations, hence resulting in a high rejection rate. On the other hand, an \(\epsilon\) that is too small may lead to a waste of computation time. If \(L\) is too small, the proposed value would be too close to the previous value. If \(L\) is too large, we might end up looping along the Hamiltonian trajectories, which also results in a waste of computation time.
Below I show two leapfrog integrators for the standard normal, one with \(\epsilon = 0.2\), and one with \(\epsilon = 0.5\). Both of them have \(L = 30\).
\(\epsilon = 0.2\):
\(\epsilon = 0.5\):
When setting \(\epsilon = 0.2\), we are almost perfectly simulating from the exact solutions. However, the number of steps seems to be a bit too large since we almost make a loop back to the start. The total energy is nearly conserved in this example, which should make the acceptance probability of this proposal close to 1.
When setting \(\epsilon = 0.5\), the approximation error becomes slightly visible, and we start looping. (we could tell from the energy plot that we make a loop in around 13 steps). The total energy is clearly not conserved in this example. We could calculate the acceptance probability of the proposal, which is: \(\exp(H(q,p) - H(\tilde{q},\tilde{p})) = \exp(2.04-2.1) = 0.94\).
In practice, it is often hard to find a proper choice of \(\epsilon\) and \(L\). The No-U-Turn Sampler (NUTS), which is now commonly used in most probabilistic programming languages, is an extension of the HMC that eliminates the need to set \(L\) and tunes the \(\epsilon\) automatically.
Betancourt, M. (2017). A conceptual introduction to Hamiltonian Monte Carlo. arXiv preprint arXiv:1701.02434.
Matthew D. Hoffman and Andrew Gelman. (2014) The no-U-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15: 1351-1381.
Neal, R. M. (2010) MCMC using Hamiltonian dynamics. In Handbook of Markov Chain Monte Carlo (eds S. Brooks, A. Gelman, G. Jones and X.‐L. Meng). Boca Raton: Chapman and Hall–CRC Press.